metal : fix fa kernel #9187

ggerganov · 2024-08-26T10:20:00Z

For some reason, test-backend-ops fails on master on M2 Ultra with the latest changes from #9159:

$ ▶ make -j tests/test-backend-ops && ./tests/test-backend-ops -o FLASH_ATTN_EXT
I ccache found, compilation results will be cached. Disable with GGML_NO_CCACHE.
I llama.cpp build info: 
I UNAME_S:   Darwin
I UNAME_P:   arm
I UNAME_M:   arm64
I CFLAGS:    -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL -DGGML_METAL_EMBED_LIBRARY  -std=c11   -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Werror -pthread -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -Werror -pthread   -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL -DGGML_METAL_EMBED_LIBRARY  -DLLAMA_USE_CURL
I NVCCFLAGS: -std=c++11 -O3 -g 
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit  -lcurl
I CC:        Apple clang version 15.0.0 (clang-1500.3.9.4)
I CXX:       Apple clang version 15.0.0 (clang-1500.3.9.4)

Embedding Metal library
/opt/homebrew/bin/ccache cc -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL -DGGML_METAL_EMBED_LIBRARY  -std=c11   -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Werror -pthread -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion  -c /var/folders/ly/vg6q1qhj1s56t_9_gvd7blgc0000gn/T/tmp.AkYdgocN2G/ggml-metal-embed.s -o ggml/src/ggml-metal-embed.o
/opt/homebrew/bin/ccache c++ -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -Werror -pthread   -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL -DGGML_METAL_EMBED_LIBRARY  -DLLAMA_USE_CURL -c tests/test-backend-ops.cpp -o tests/test-backend-ops.o
/opt/homebrew/bin/ccache c++ -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -Werror -pthread   -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL -DGGML_METAL_EMBED_LIBRARY  -DLLAMA_USE_CURL ggml/src/ggml-blas.o ggml/src/llamafile/sgemm.o ggml/src/ggml-metal.o ggml/src/ggml-metal-embed.o ggml/src/ggml.o ggml/src/ggml-alloc.o ggml/src/ggml-backend.o ggml/src/ggml-quants.o ggml/src/ggml-aarch64.o tests/test-backend-ops.o -o tests/test-backend-ops -framework Accelerate -framework Foundation -framework Metal -framework MetalKit  -lcurl
Testing 2 backends

Backend 1/2 (CPU)
  Skipping CPU backend
Backend 2/2 (Metal)
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 154618.82 MB
  Backend name: Metal
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000604905 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.001518410 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000743344 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000722942 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.001181968 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000782221 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000688070 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000690445 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.001651777 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000826782 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000778284 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000567809 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.001188310 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000716152 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000686001 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000798424 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.001515117 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000909834 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000806680 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.001511847 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000689971 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000661116 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000513066 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000619641 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.008018188 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007280940 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007653298 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007133281 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.008177775 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.006881671 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.008105932 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.006895132 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007553170 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.006566270 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007686164 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007350518 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007649443 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007418936 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.006870192 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.006712699 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007743271 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.006541790 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007529393 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.006782106 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.006798391 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007439705 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007285962 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007053095 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  1366/1414 tests passed
  Backend Metal: FAIL

ggml_metal_free: deallocating
1/2 backends passed
FAIL

Maybe a race condition or running out of registers? Not sure.

The proposed change in this PR fixes the tests and the performance remains the same:

./scripts/compare-commits.sh master gg/metal-fix-fa -m ./models/tinyllama-1b/ggml-model-f16.gguf -m ./models/tinyllama-1b/ggml-model-q8_0.gguf -m ./models/tinyllama-1b/ggml-model-q4_0.gguf -m ./models/llama-8b-v3/ggml-model-f16.gguf -r 10 -fa 1

Model	Model Size [GiB]	Num. of Par.	Test	t/s master	t/s gg/metal-fix-fa	Speedup
llama 1B F16	2.05	1100048384	pp512	7732.45	7714.03	1.00
llama 1B F16	2.05	1100048384	tg128	146.94	147.20	1.00
llama 1B Q4_0	0.59	1100048384	pp512	7064.83	7053.91	1.00
llama 1B Q4_0	0.59	1100048384	tg128	236.93	236.28	1.00
llama 1B Q8_0	1.09	1100048384	pp512	6953.21	6934.44	1.00
llama 1B Q8_0	1.09	1100048384	tg128	202.32	202.45	1.00
llama 8B F16	14.96	8030261248	pp512	1398.25	1396.37	1.00
llama 8B F16	14.96	8030261248	tg128	38.53	38.51	1.00

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

slaren · 2024-08-26T10:33:29Z

I cannot reproduce this on M3 Max, but a race condition seems like it would be the most likely cause. If that's the case, it is likely to create more issues in the future, so it would be good to fix it completely rather than making some changes that make the issue less likely, but still possible.

ggerganov · 2024-08-26T12:20:16Z

I found another fix with #9188, but I still don't understand why master fails. It also fails on M1 Pro. Additionally, every generation with llama-cli and a fixed random seed results in a different output, which does indicate a race condition.

Another observation is that the race is likely inside this block:

llama.cpp/ggml/src/ggml-metal.metal

Lines 2142 to 2160 in 7a3df79

    
           simdgroup_store(mqk, ss + 8*cc, TF, 0, false); 
        
           const short tx = tiisg%4; 
        
           const short ty = tiisg/4; 
        
           // mqk = mqk*scale 
        
           ss[8*cc + ty*TF + 2*tx + 0] *= scale; 
        
           ss[8*cc + ty*TF + 2*tx + 1] *= scale; 
        
           if (logit_softcap != 0.0f) { 
        
               ss[8*cc + ty*TF + 2*tx + 0] = logit_softcap*precise::tanh(ss[8*cc + ty*TF + 2*tx + 0]); 
        
               ss[8*cc + ty*TF + 2*tx + 1] = logit_softcap*precise::tanh(ss[8*cc + ty*TF + 2*tx + 1]); 
        
           } 
        
           if (mask != q) { 
        
               // mqk = mqk + mask*slope 
        
               ss[8*cc + ty*TF + 2*tx + 0] += slope*mp[ic + 8*cc + ty*nb31/sizeof(half) + 2*tx + 0]; 
        
               ss[8*cc + ty*TF + 2*tx + 1] += slope*mp[ic + 8*cc + ty*nb31/sizeof(half) + 2*tx + 1]; 
        
           }

The reason is that if I change the tx and ty vars to be loops in [0..4) and [0..8) respectively, using only tiisg == 0, the results are again correct:

diff --git a/ggml/src/ggml-metal.metal b/ggml/src/ggml-metal.metal
index aba0b9a0..df1be087 100644
--- a/ggml/src/ggml-metal.metal
+++ b/ggml/src/ggml-metal.metal
@@ -2141,8 +2141,9 @@ kernel void kernel_flash_attn_ext_f16(
 
                     simdgroup_store(mqk, ss + 8*cc, TF, 0, false);
 
-                    const short tx = tiisg%4;
-                    const short ty = tiisg/4;
+                    if (tiisg == 0) {
+                        for (short ty = 0; ty < 8; ++ty) {
+                            for (short tx = 0; tx < 4; ++tx) {
 
                     // mqk = mqk*scale
                     ss[8*cc + ty*TF + 2*tx + 0] *= scale;
@@ -2157,6 +2158,10 @@ kernel void kernel_flash_attn_ext_f16(
                         // mqk = mqk + mask*slope
                         ss[8*cc + ty*TF + 2*tx + 0] += slope*mp[ic + 8*cc + ty*nb31/sizeof(half) + 2*tx + 0];
                         ss[8*cc + ty*TF + 2*tx + 1] += slope*mp[ic + 8*cc + ty*nb31/sizeof(half) + 2*tx + 1];
+                    }
+
+                            }
+                        }
                     }
                 }
             }

Any ideas what could be the issue? I'm not sure how to debug this further

slaren · 2024-08-26T12:28:55Z

I don't know what threads use what data, and I can't test it either. Intuitively I would assume that a barrier is necessary before the softmax to synchronize the shared memory in ss.

ggerganov · 2024-08-26T13:20:21Z

It does look like a barrier is necessary, though adding one did not fix the problem.

However, I rewrote the code to make the threads access the data more coherently and this resolved the problem: #9189

ggerganov · 2024-08-26T15:31:17Z

Superseded by #9189

metal : fix fa kernel

aa931d0

ggerganov requested a review from slaren August 26, 2024 10:20

ggerganov mentioned this pull request Aug 26, 2024

metal : another fix for the fa kernel #9188

Closed

4 tasks

ggerganov mentioned this pull request Aug 26, 2024

metal : separate scale and mask from QKT in FA kernel #9189

Merged

4 tasks

ggerganov closed this Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal : fix fa kernel #9187

metal : fix fa kernel #9187

ggerganov commented Aug 26, 2024

slaren commented Aug 26, 2024

ggerganov commented Aug 26, 2024

slaren commented Aug 26, 2024 •

edited

Loading

ggerganov commented Aug 26, 2024

ggerganov commented Aug 26, 2024

metal : fix fa kernel #9187

metal : fix fa kernel #9187

Conversation

ggerganov commented Aug 26, 2024

slaren commented Aug 26, 2024

ggerganov commented Aug 26, 2024

slaren commented Aug 26, 2024 • edited Loading

ggerganov commented Aug 26, 2024

ggerganov commented Aug 26, 2024

slaren commented Aug 26, 2024 •

edited

Loading